Multi-field Correlated Topic Modeling

نویسندگان

  • Konstantin Salomatin
  • Yiming Yang
  • Abhimanyu Lad
چکیده

Popular methods for probabilistic topic modeling like the Latent Dirichlet Allocation (LDA, [1]) and Correlated Topic Models (CTM, [2]) share an important property, i.e., using a common set of topics to model all the data. This property can be too restrictive for modeling complex data entries where multiple fields of heterogeneous data jointly provide rich information about each object or event. We propose a new extension of the CTM method to enable modeling with multi-field topics in a global graphical structure, and a mean-field variational algorithm to allow joint learning of multinomial topic models from discrete data and Gaussianstyle topic models for real-valued data. We conducted experiments with both simulated and real data, and observed that the multi-field CTM outperforms a conventional CTM in both likelihood maximization and perplexity reduction. A deeper analysis on the simulated data reveals that the superior performance is the result of successful discovery of the mapping among field-specific topics and observed data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

Determination of geochemical anomalies and gold mineralized stages based on litho-geochemical data for Zarshuran Carlin-like gold deposit (NW Iran) utilizing multi-fractal modeling and stepwise factor analysis

The Zarshuran Carlin-like gold deposit is located at the Takab Metallogenic belt in the northern part of the Sanandaj-Sirjan zone, NW Iran. The high-grade ore bodies are mainly hosted by black shale and cream to gray massive limestone along the NNE-trending extensional fault/fracture zones. The aim of this investigation was to determine and separate the gold mineralized stages based on the surf...

متن کامل

Performance comparison of finite-difference modeling on Cell, FPGA and multi-core computers

How does the performance of Cell, field-programmable gate array (FPGA), and multi-core computers compare for finitedifference modeling of the acoustic wave equation? In this paper I answer this question by assessing implementations on each of these architectures. Results show that on average, 7.49, 5.01, and 3.74 GFLOPs were sustained, respectively, by the FPGA, quad-core, and Cell machines for...

متن کامل

Incorporating Word Correlation Knowledge into Topic Modeling

This paper studies how to incorporate the external word correlation knowledge to improve the coherence of topic modeling. Existing topic models assume words are generated independently and lack the mechanism to utilize the rich similarity relationships among words to learn coherent topics. To solve this problem, we build a Markov Random Field (MRF) regularized Latent Dirichlet Allocation (LDA) ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009